Concept matching system

ABSTRACT

A system for retrieving documents related to a concept from a text corpus includes a set of stored semantic classes which are combinable to express the concept each class including a set of keywords, each set of keywords including at least one keyword. Syntactic rules are applied to identified text portions which include one or more of the keywords. A rule is satisfied when keywords from the first and second semantic classes are in any one of a plurality of syntactic relationships. A concept matching module identifies text portions within the text corpus which include one or more of the keywords, for applying the syntactic rules to the text portions, and for identifying those text portions which satisfy at least one of the rules. Documents to be retrieved may include at least one of the identified text portions.

CROSS REFERENCE TO RELATED PATENTS AND APPLICATIONS

Reference is made to copending application Ser. No. ______, filedcontemporaneously herewith, entitled CONCEPT MATCHING, by Agnes Sándorand Aaron Kaplan [hereinafter Sándor and Kaplan], the disclosure ofwhich is totally incorporated herein by reference.

BACKGROUND

The present exemplary embodiment relates generally to documentprocessing. It finds particular application in conjunction with aprotocol for identifying text which relates to a concept, such aspossible use of a new product advertised in a press release, and will bedescribed with particular reference thereto. However, it is to beappreciated that the present exemplary embodiment is also amenable toother like applications.

Database and Internet searching are widely used for retrieving documentsthat are relevant to the information needs of a user. Many documentprocessing problems involve finding text passages which express a givenconcept. Examples include information retrieval (IR), informationextraction (IE), and question answering (QA) problems. Some concepts aregenerally expressed by a small set of fixed words or expressions andthus are readily easy to detect. For example, they may be detectedautomatically, using a simple keyword search or a set of regularexpressions. Other concepts are more difficult to detect because theirexpressions are more varied. The problem stems from the fact thatlanguage is both productive and ambiguous: the same concept can beexpressed by infinitely numerous expressions and, at the same time, thewords making up the expressions can have different meanings in othercontexts. Keyword searching is thus generally not applicable forconcepts conveyed by a wide range of linguistic expressions.

Existing document processing systems typically deal with the productivenature of language by allowing substitution of similar expressions orexpression schemata, where similarity is defined such that if oneexpression is relevant to a user's query, then similar ones can beassumed to be relevant as well. These similarities are usually definedusing three levels of linguistic information: morphologicalequivalences, syntactic equivalences, and lexical semantic equivalences.For morphological equivalences, a morphological processing component candetect word forms that are inflected or derived from the same root,e.g.: X acquires Y; X acquired Y; and the acquisition of Y by X. Forsyntactic equivalences, a system can detect similarity between pairs ofexpressions such as: X acquired Y; and Y was acquired by X, by usingspecified syntactic rules. For lexical semantic equivalences, a systemcan be provided (by hand or using corpus statistics) with informationabout various semantic relationships, e.g. synonymy or hyponymy, amonglexical units, and this information can be useful for detectingsimilarities such as: X acquired Y; and X bought Y.

By using generic linguistic resources of the above sorts, a system canallow users to specify a single search pattern that matches a range ofdifferent expressions. For example, using currently available linguisticresources, a user searching for descriptions of acquisitions couldconceivably write a single pattern that matched all of the abovedescriptions of transactions. However, the range of expressions thatconvey a given concept may be even greater. For example, simplemorphological, lexical, and syntactic substitutions would not besufficient to match that same pattern with expressions such as:

Y will end up being owned by X

Y became the new owner of X

Some concepts tend to be expressed in relatively limited ways, and forsuch concepts, reasonable coverage and precision can be attained usingthe standard types of linguistic resources described above, if not witha single query pattern then with relatively few of them. The example ofcommercial transactions appears to be such a concept—in most corpora,the majority of commercial transactions can be matched with one of thepatterns “X sell Y” or “Z buy Y” using the standard types of resources.For this reason, the problem of finding purchases of companies is oftenused as an example in the IE literature. But for other concepts, thenumber of patterns needed becomes unmanageable.

The challenge, then, is to provide a way of expressing patterns thatgeneralize over the types of variation that cannot be accounted forusing traditional morphological, syntactic, and lexical semanticresources. There has been a great deal of theoretical work on the sortof linguistic resources that would be necessary to do this in a generalway—for example, the ideal semantic lexicon would contain, in amachine-processable form, the information that a purchase involves achange of ownership, and that an expression following “end up” is theresultant state of a change. A system endowed with such a lexicon couldconceivably infer that “Y will end up being owned by X” is likely toindicate a purchase of X by Y. But in practice, the information thatcould be considered for inclusion in such a lexicon is boundless, andthe process of encoding it is time-consuming and error-prone, sopractical, general-purpose results do not appear to be forthcoming.

Text mining in scientific literature is currently an active area ofresearch. The focus of such research is primarily on extracting domaininformation, e.g., protein function or chemical pathways. These methodsproduce (and some then synthesize) a large volume of information, with asmall amount coming from each of many documents.

REFERENCES

-   Salton, G., “The SMART Information Retrieval System” (Prentice Hall,    Englewood Cliffs, N.J. (1971)) discloses a searching system that    uses vector-space representations.-   U.S. Pat. No. 6,246,977 to Messerly, et al., discloses an    information retrieval utilizing semantic representation of text and    constrained expansion of query words.-   U.S. Pat. No. 6,678,677 to Roux, et al., discloses a method for    information retrieval using a semantic lattice.-   Teufel, Simone, “Meta-discourse Markers and Problem-Structuring in    Scientific Articles,” in ACL Workshop on Discourse Structure and    Discourse Markers pp. 43-49 (1998) discloses searching scientific    literature for meta-discourse markers that indicate rhetorical    structure, rather than for domain information. In Teufel's approach,    the search patterns are simply word and punctuation strings    (n-grams) of length two to six, and only n-grams that are frequent    in the training corpus are considered.-   Zhang, Y., et al., “Novelty and Redundancy Detection in Adaptive    Filtering,” Proc. ACM SIGIR, pp. 81-88 (2002) discusses the problem    of “novelty detection,” in which the goal is to decide whether or    not a given sentence or document contains information that was not    already expressed in previous documents.-   Allan, J., et al., “First Story Detection in TDT is Hard,”    Conference on Information and Knowledge Management, Proceedings of    the Ninth International Conference on Information and Knowledge    Management, pp 374-381 (2000) discusses the problem of “first story    detection,” in which the object is to find sets of newspaper    articles that describe the same event, in order to identify the    earliest report of each event.

BRIEF DESCRIPTION

Aspects of the present disclosure in embodiments thereof include asystem and a method for retrieving documents related to a concept from atext corpus. The system may include memory which stores a set ofsemantic classes which are combinable to express the concept and a setof keywords for each of the semantic classes to be used in searchingdocuments in the text corpus, each set of keywords including at leastone keyword, and a plurality of syntactic rules to be applied toidentified text portions which include one or more of the keywords. Eachof the syntactic rules identifies a first of the semantic classes and asecond of the semantic classes. A rule is satisfied when a keyword fromthe first of the semantic classes and a keyword from the second of thesemantic classes are in any one of a plurality of syntacticrelationships. A concept matching module identifies text portions withinthe text corpus which include one or more of the keywords, applies thesyntactic rules to the text portions, and identifies those text portionswhich satisfy at least one of the syntactic rules. Documents can thus beretrieved which include at least one of the identified text portions.

In another aspect, a computer implemented system for retrievingdocuments related to a concept from a text corpus includes a componentfor labeling selected keywords in the text corpus, a component whichassociates the labeled keywords with a semantic class, each of thesemantic classes including a plurality of the keywords, a componentwhich labels pairs of keywords of selected semantic classes which are inany one of a plurality of syntactic relationships, and a component whichidentifies documents which include a labeled pair of keywords.

The method for retrieving documents related to a concept from a textcorpus includes labeling keywords in the text corpus which belong to atleast one of a plurality of predefined semantic classes, labeling pairsof labeled keywords which are in any one of a plurality of syntacticrelationships and which meet one of a plurality of predefined syntacticrules, and labeling documents which include at least one labeled pair.At least a portion of the documents which include at least one labeledpair are retrieved.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of a concept matching system; and

FIG. 2 is a flowchart for an exemplary process for concept matching in atext database using developed keywords, constituent notions, andco-occurrence syntax rules.

DETAILED DESCRIPTION

A system for concept matching in a text corpus, such as a collection ofpress releases about new products, is suited to identifying documentsrepresenting a concept, such as, for example, possible uses of the newproduct (“possible use”). The term “document” as used herein maycomprise a portion of a document, such as an article abstract, or anentire document. Each document includes at least one text portion, andgenerally a plurality of text portions. Each text portion is typically asentence, although it is contemplated that a text portion may be aportion of a sentence, such as a clause or portion in quotations. In thecase of a press release, for example, there are typically on the orderof five to ten sentences per document.

One exemplary concept matching system draws on the need of identifyingportions of press releases that inform the readers about thepossibilities of using the new product advertised in order that the useridentifies those products relevant for the user's desired purposes. Inaspects of the exemplary embodiment, the present system identifiespossible uses by using lexico-syntactic pattern matching rules. Therules may be developed in the framework described in Sándor and Kaplan,the disclosure of which is totally incorporated herein by reference.

As illustrated in FIG. 1, an exemplary concept matching system 1 mayconsist of the following components:

-   -   1. A document database 10 that stores the documents to be        searched, and facilitates their retrieval for further processing        (or a suitable wired or wireless link 12 for accessing such a        text corpus and retrieving documents therefrom).    -   2. A concept matching module 14 that identifies expressions that        describe possible uses.    -   3. A ranking module 16 that ranks documents according to a        scheme that takes into account both the results of the concept        matching by the concept matching module 14, and (optionally)        relevance to a user-specified topic.    -   4. A memory 18 which stores a set of preselected keywords 20 to        be used in text searching.

The preselected keywords will be referred to herein as stored keywordsor simply as keywords. The document database or other text corpus 10 tobe searched is accessible to the concept matching module 14. Thedatabase may be provided on a hard disk of a computer system or onparticular storage medium as CDs, DVDs, or other digital storage media.Alternatively, the database may be stored at a remote location connectedto the computer system via a data transmission network. Any documentstorage system that allows keyword-based searches to be performed on aset of documents could alternatively be used. Alternatively oradditionally, a system that allows indexing on syntactic information,such as that disclosed by U.S. Pat. No. 6,678,677 to Roux et al., may beemployed. Such a system can reduce the number of documents that must beprocessed by the concept matching module, which tends to speed upquery-time processing. The Roux patent is incorporated herein in itsentirety by reference. It will be appreciated that the memory 18 may beincorporated into the concept matching module 14 rather than a separatecomponent, as illustrated.

As illustrated in FIG. 2, an exemplary process for identifying documentsincludes identifying preselected stored keywords 20 in the text corpuswhich belong to at least one of a plurality of predefined semanticclasses 22 (Step S100) and labeling them (Step S110), identifying pairsof labeled keywords which are in any one of a plurality of syntacticrelationships (Step S120), and labeling them (Step S130), identifyinglabeled pairs which meet one of a plurality of predefined syntacticrules 24 and labeling sentences containing them (Step S140), labelingdocuments which include at least one sentence containing a labeled pair(Step S160) and retrieving at least a portion of the documents whichinclude at least one sentence containing a labeled pair (Step S170).Further steps may include ranking the documents according to relevance(Step S180) and outputting a list of documents ordered by rank (StepS190).

Development of a Concept Matching System

Sándor and Kaplan discloses a methodology for developing systems thatfind concepts in natural language text. The method uses syntacticdependencies and task-specific semantic classes. The method isparticularly suited to developing a system which, with a reasonableassurance of accuracy, uncovers all or at least a preponderance oftext/documents considered relevant to the particular search undertaken.In particular, the methodology develops a concept matching system whichevaluates a set of documents which form all or a selected portion of adatabase. The methodology for developing a system for finding a selectedtarget concept in a text corpus can be generally described as includingthe following operations:

-   -   1. Identifying a set of task specific semantic classes        (“constituent notions”) which can be combined to express the        target concept.    -   2. Identifying a list of keywords for each of the constituent        notions to be used in text searching in a text corpus, such as a        document database.    -   3. Establishing a set of patterns (“syntactic rules”), involving        syntactically related keywords of the constituent notions        identified in step 1, which are applied to text uncovered in the        text searching.

Further operating steps may include:

-   -   4. Evaluating the methodology on a test text corpus, such as a        subset of the documents in the database.    -   5. Repeating steps 1-4, as appropriate, to refine the        constituent notions, keyword lists, and set of syntactic rules        in order to improve the retrieval performance of the system.

Retrieval performance of the concept matching system can be defined interms of recall and/or precision. Recall is the fraction of all relevantdocuments that are retrieved. Precision is the fraction of all retrieveddocuments that are relevant.

The words that instantiate a given constituent notion form a sort ofsemantic class. However, the present semantic classes differ fromtraditional ones in that their members need not share any of the usualmorphological, syntactic or semantic properties. The methodology ofSándor and Kaplan includes developing a set of task-specific semanticclasses which can be combined to express a target concept in a search tobe performed. These classes will be referred to herein as constituentnotions and may be identified by a notion identifier, such as a word orphrase, alphanumeric characters or the like. For each constituent notionwithin the set of constituent notions that have been identified, a setof keywords which relate to the constituent notion is identified. Whileparticular reference is made to keywords as being single words, the setof keywords may also include hyphenated words, compound words, roots ofwords (such as acquire or acquir* in place of acquiring, acquired,etc.), and the like. Although the set of keywords may include the notionidentifier, as well as synonyms and homonyms of the notion identifier,it need not do. The set of keywords generally includes words which arerelated to the constituent notion and, in particular, words which may beused in relevant expressions when the constituent notion is beingconveyed. The words selected for a particular constituent notion arethus related to each other primarily in that they relate to the sameconstituent notion, not in that they have the same meaning, althoughsome words may have the same or similar meanings as other words in theset. It is contemplated that a keyword associated with one constituentnotion may also be associated with one or more, but generally fewer thanall, of the other constituent notions in the set of constituent notions.

While the set of keywords for a given constituent notion may include asfew as one keyword, it is contemplated that several of the constituentnotions in the fully developed concept matching system (e.g., at leastthree, and in one embodiment, at least four constituent notions) willeach be associated with at least five or at least ten keywords.Typically, most or all of the constituent notions will be associatedwith a greater number of keywords, e.g., five, ten or more. The numberof keywords in a set is not limited. However, keywords which are foundnot to improve information retrieval are generally dropped from the setof keywords during development of the concept matching system. Duringthe development of the concept matching system, keywords can be added toor removed from the sets of keywords. Similarly, the number of semanticclasses is not limited, but will be generally be at least three and maybe less than ten.

Various ways for developing the set of keywords are contemplated. Theseinclude, for example, analysis of a set of documents which have alreadybeen determined to satisfy the search criteria, keywords proposed by aninteractive computer program, user evaluation of sentences proposed by acomputer processing system in response to input keywords, and the like.The words selected for the constituent notions can be later verified bytesting the methodology on a test set of documents and refined, asappropriate. For example, the set of keywords is refined by removingkeywords which prove to be overly general and/or adding additionalkeywords which retrieve additional relevant documents, to increase theoverall retrieval performance.

An exemplary methodology involves the development of a set of syntacticrules by which text uncovered by searching for the selected keywords isto be further evaluated for relevance. The rules identify pairs ofconstituent notions which when present in text in syntactic relationshipto one another, are found to improve or anticipated to improve retrievalperformance. Once again, this may be an iterative process. For example,a rule is added to or deleted from the set of rules or an existing rulemodified and the new set of rules tested on a test corpus, such as asubset of the documents in the database, to assess the retrievalperformance of the modified rule set.

Syntactic relationships between words in a punctuated natural languagetext portion can include, for example, the following relationships:

S1 between verb and its subject

S2 between verb and its object

S3 between adjective or adverb and the noun, verb or adjective itmodifies

S4 between a noun and another noun that it premodifies

S5 between a noun and a noun or verb that it modifies by prepositionalattachment

For computer implementation of the system, an automated parser can beused to identify syntactic relationships in the documents which areselected by means of the search for keywords. The parser examines thetext for syntactic relationships and labels the pairs of words involvedin the syntactic relationships. The output of the parser is a set ofpairs of linked words and, optionally, the syntactic relationshipbetween the linked words in each pair. The number of syntacticrelationships which may exist is virtually unlimited and in general, theparser searches for only a limited subset of the possible syntacticrelationships. An exemplary parser of this type is described, forexample in U.S. Patent Application No. 2003/0074187, published Apr. 17,2003, to Ait-Mokhtar, et al., which is incorporated herein in itsentirety by reference. For any sentence there may be zero, one or moresyntactic relationships identified. Additionally, some keywords mayappear in more than one syntactically related pair.

The set of rules developed identifies pairs of semantic classes whichare in any one of a plurality of syntactic relationships. In any rule,each of the two related words (such as a verb and its subject)represents one of the constituent notions. For example for a set ofconstituent notions N1, N2, N3, . . . Nx and a set of selected syntacticrelationships to be detected S1, S2, S3, S4, S5, . . . Sy, where x isthe number of constituent notions in the set of constituent notions, andy is the number of syntactical relations which are accepted, some or allof the constituent notions are incorporated into at least one of therules developed. For example,

Rule R1={S1, S2, S3, S4, S5, . . . Sy} [N1+N2], i.e., words formsemantic classes N1 and N2 are in any syntactic relationship S selectedfrom S1, S2, S3, . . . Sy.

Rule R1 is satisfied for a given text portion when any keyword belongingto constituent notion N1 and any keyword belonging to N2 are insyntactic relationship, for example S may be S2, an object/verbsyntactic relationship. In this example, a keyword from N1 may be theverb and the keyword from N2, the object, or vice versa. In its broadestapplication, the syntactic relationship S between the two constituentnotions can be any or all of the syntactic relationships to be detectedby the parser. Generally, an automated parser is capable of identifyinga fixed set of syntactic relationships at a sentence level. Inaccordance with aspects of the exemplary embodiment, the syntacticrelationships which satisfy a given rule are all of those which theparser to be used by the concept matching system is capable ofidentifying. In an alternative embodiment, a rule may define a specificsubset of the syntactic relationships to be applied by the parser, suchas one, two, three, four, six, or eight of the syntactic relationshipsidentified by the parser. In one embodiment, all of the rules apply thesame set of syntactic relationships, which, as noted above, may be allof the syntactic relationships which can be identified by the parser.

The rules can be written in the same grammar formalism as used by theparser so that the rules are readily applied to the output of theparser. Alternatively, a different formalism is used for the rules.

The developed syntax rule may be associated with a weighting. Theweighting roughly reflects the anticipated or observed relativeimportance of the rule in the overall retrieval performance. In itssimplest form, all rules may be given the same weighting of 1. In morecomplex weighting schemes, the rules may each be assigned a weighting Wof 0<W<1. For example, some rules which are considered to stronglyimprove retrieval performance are assigned a weighting of 1 whileothers, considered to weakly affect retrieval performance, are assigneda weighting of 0.5. A weighted sum over all the rules of the product ofnumber of occurrences that a rule is satisfied within a sentence orother text portion being evaluated (the occurrence rate T) and itsweighting W can then be made.

In one embodiment, the concept matching system only retrieves documentswith a weighted sum of at least 1. For the example above, this issatisfied by one rule with a weighting of 1 or two rules with aweighting of 0.5 in the same sentence. In another embodiment, a morerelaxed system retrieves documents having a weighted sum of at least 1(which could be satisfied by two rules with a weighting of 0.5 indifferent sentence.

A ranking of the documents can be made. This may be based on theirscores, for example, by taking into account the weighted sum or otherranking schemes, such as those based on the occurrence of specifickeywords. The ranking of documents may be performed by a ranking moduleof the system.

As an alternative to ranking, the system may list all documents whichare considered to be relevant. The minimum relevance criteria may be forexample, one pair of constituent notions in syntactic relationship or,where a weighting is applied, a minimum value of the sum of (occurrencesx weighting) e.g., a minimum of 1. A minimum of 1 can be obtained, forexample, where a “strong” pair of constituent notions (with a weightingof 1) is present or where two “weak” pairs of constituent notions (eachwith a weighting of 0.5) are present.

An exemplary methodology is outlined below. At each step, it isindicated what kind of automatic support the development environment(e.g., a computer system used for developing the concept matchingsystem) may offer.

1. Find sample expressions that express the desired concept, and extractfrom them a first set of keywords. This could be supported by a simplekeyword query interface-given a keyword or combination of keywords thatthe user thinks might be useful, the system could provide samplesentences in which that word occurs. The user could label retrievedexamples as relevant or not relevant, for use as training data for step2b below. In one embodiment, a computer processing unit is connected toa user input, such as a keyboard or touch screen. The user inputs aselected keyword and the processing unit proposes sample sentences. Theprocessing unit may alternatively or additionally propose words, forexample on a display, such as a screen. The user may select, from theproposed words, those which are considered by the user to be mostappropriate for the search. These words are then added to the set ofkeywords.

2. Iteratively,

-   -   a. By examining the keywords collected so far, manually identify        a number of constituent notions that can be combined to express        the target concept, and form classes of keywords such that all        of the words in a class express the same constituent notion.        Each keyword may express one or more constituent notions. This        step depends on human intuition and is not readily automated;        automatic support may be limited to an interface that        facilitates the manual classification.    -   b. Write rules indicating which combinations of constituent        notions can be combined to express the desired concept, and how        these combinations can be realized syntactically. Examples        labeled in step 1 can be parsed automatically to give        suggestions for such rules that the user can validate or reject.        For combinations of keywords that do not occur in the sample        expressions, the system could automatically find examples in a        corpus and present them to the user for annotation.    -   c. Apply the rules to a test text corpus and evaluate the        results. The application of the rules to a corpus can be        automated entirely. An efficient way to implement this involves        searching the text for sentences that contain the necessary        keywords, parsing these sentences, and then filtering the parsed        sentences according to the co-occurrences rules. The evaluation        of the results is generally performed manually.    -   d. Refine:        -   i. If coverage is not yet sufficient, identify additional            keywords (and/or remove keywords) and repeat from step b).            This may be done by hand by examining sentences retrieved in            step c); the system may also make suggestions automatically            by identifying words that frequently occur in syntactic            relations with known keywords.        -   ii. If precision is not sufficient, refine co-occurrence            rules and repeat from step c). The system may prompt the            user to mark sentences that were retrieved but should not            be, indicating which rule was responsible for the selection            of each such sentence, and allowing the user to edit that            rule.

The exemplary system disclosed herein may be developed with a small,manually-collected set of documents, such as press releases, in whichthe expressions that convey the target concept are identified. Theconcept of possible use is used here as an example. It will beappreciated that similar systems may be developed to search for textrelating to other complex concepts.

The following are examples of sentences that the system may be adaptedto detect:

-   -   1. Applications of AlInGaP LEDs, now centering on outdoor        displays and car interior trim, are expected to broaden to        include illumination, traffic light, and car exterior trim.    -   2. Designed for the retail, hospitality and architectural        industries, the package consolidates and delivers multiple        lighting processes in one system.    -   3. Developers immediately benefit from features such as viewing        structural elements without constantly reverting to its source        code.        The Concept Matching Module

The concept matching module 14 can be built on the conceptual frameworkdescribed in Sándor and Kaplan, the disclosure of which is totallyincorporated herein by reference. The framework allows a complexconcept, in the exemplary embodiment, the concept of a change in thedirection of scientific research, to be described using a set ofsyntactically heterogeneous keyword classes, and co-occurrence rulesthat use syntax.

With reference once more to FIG. 1, the concept matching module 14 canaccess or include the memory 18 which stores the preselected keywords20, the constituent notion(s) 22 with which each keyword is associated,and the set of rules 24, in computer readable form. The illustratedconcept matching module 14 includes processing software, including: acomponent 30 which, using the keywords stored in the memory, labelskeywords in the text corpus with the names of the semantic classes towhich they belong; a component 34 which labels pairs of keywords ofselected semantic classes which are in any one of a plurality ofsyntactic relationships; and a component 36 which identifies documentswhich include a labeled pair of keywords 20. A user input 38, such as akeyboard or touch screen, and an associated display 40, such as ascreen, allow a user to interact with the concept matching module. Theconcept matching module 14 accesses a parser 42 which identifies thesyntactic relationships in text retrieved form the text corpus whichincludes keywords.

For each document, rules can be applied at four levels:

1. word level: keywords belonging to certain semantic classes arelabeled;

2. word pair level: pairs of syntactically linked keywords are labeled,based on the labels of the individual words;

3. sentence level: sentences are labeled, based on the labels of thekeywords and word pairs they contain; and

4. document level: a label is assigned to the document based on the setof sentence labels it contains.

Following the methodology described in Sándor and Kaplan discussedabove, the concept of possible use can be broken down into constituentnotions, and lists of words that express those constituent notionscompiled. On the basis of a manually collected set of example sentences,the following seven constituent notions can be identified (TABLE 1).Each constituent notion is presented with a few examples of words thatexpress it. TABLE 1 CONSTITUENT EXEMPLARY NOTION BRIEF DESCRIPTIONKEYWORDS Future reference to future when introduce, available, theproduct will be used future, promising Use reference to uses of employ,fulfill, feature devices User reference to the user of customer, demand,the new product industry Produce reference to the produc- create,release, design tion of the new product Producer reference to theproducer company, deliver, of the new product manufacturer, offerProduct reference to the new achievement, application, product devicePossibility express possibility can, expect, ideally

Generally depending on the concept, different constituent notions areused to describe different concepts. Once a paraphrase (or definition)of the concept is developed using a set of example documents, aconstituent notion may be defined for each word in the paraphrase. Inthe example of possible use, a simple paraphrase sheds light on themotivation of the choice of its constituent notions: the concept ofpossible use involves a user that in the future will have thepossibility to use a product produced by a producer. It will beappreciated that possible use expression need not contain all sevenconstituent notions. The rules presented below describe whichcombinations of constituent notions may be considered sufficient toindicate a possible use.

With reference once more to FIG. 2, the first level of processing (StepS100) is thus to find occurrences in the text of words from the aboveseven classes, in order to assign the appropriate class labels to thoseoccurrences (Step S120). The keywords associated with the constituentnotions can be automatically expanded by the parser or other keywordsearch program to include words with the same root as those specificallyenumerated (Substep S100A). The linguistic tool used for finding theroots of words may be a component of the parser 42 used for developmentof the concept matching system.

For example, in the following sentence drawn from the sample set ofrelevant documents, the constituent notions of PRODUCE, PRODUCT,PRODUCER, USER, USE are identified:

-   -   Designed for [PRODUCE] the retail, hospitality and architectural        industries [USER], the package consolidates and delivers        [PRODUCER] multiple lighting processes [USE] in one system        [PRODUCT].

In a second level of processing (Step S120), each sentence is parsed,using a dependency parser 42 that outputs pairs of words that aresyntactically related. An automated parser capable of this type ofprocessing is described for example in U.S. Patent Application No.2003/0074187, the disclosure of which is incorporated herein in itsentirety by reference. For the example sentence given above, the parseryields (among other dependencies):

-   -   (retail, the)    -   (consolidates, package)    -   (system [PRODUCT], delivers [PRODUCER])    -   (designed [PRODUCE], industries [USER])        indicating that “The” is syntactically related to “retail” (in        this case “the” is the determiner of the noun phrase whose head        is “retail”), “package” is related to “consolidate” (it is the        verb's subject), and so forth.

A weighting may be applied to each syntactically related pair of wordswhich each also express one of the constituent notions (Substep S130A).For example, the pair-level label strong or weak is assigned to eachpair in which both words received constituent notion labels. Within eachweighting type (weak and strong in the present embodiment), furtherrefinements or subtypes may be defined. For example, the weak label maybe further refined into a number of subtypes that correspond todifferent aspects of the complex concept of possible use. These subtypesmay be used as input to more sophisticated rules in subsequent levels.Additionally, one or more rules may require the presence of two or morepairs from specified constituent notions.

In one embodiment, word label pairs that receive the label weak arespecifically identified; other pairs receive the label strong bydefault. Exemplary pair-level rules, which list explicitly the wordlabel pairs that receive the label weak may be as shown in Table 2. Inthis specific example, pairs that match the patterns in Table 2 areassigned the pair-level label weak. The variable x can be instantiatedby any label. A pair of words which both received word-level labels, andwhere the pair does not match any of these patterns receives the labelstrong. TABLE 2 PAIR (product, product) (product, producer) (product,produce) (keywords labeled weak, x) (possibility, x)

For example, given the sentence:

-   -   AMD is leading the way to pervasive 64-bit computing and NVIDIA        is helping make it possible [POSSIBILITY] for developers [USER]        to take advantage of this technology . . .    -   the parser yields the pair    -   (developers_USER, possible_POSSIBILITY)    -   which matches the pattern (possibility, x) and is thus assigned        the label weak.

The third level of rules assigns each sentence a label, based on thekeyword pair(s) it contains (Step S160). The label can be based onwhether the sentence satisfies a minimum relevance criterion. If noweightings are applied, the minimum relevance criterion can be aninteger representing a number of rules to be satisfied by the sentencein order for the sentence to be considered relevant. In one embodiment,the minimum number is one. Where weighting is applied to the rules, amore complex ranking scheme may be developed. For example, a sum of theproducts of number of times a rule is satisfied and its weighting can beobtained and a minimum acceptable value of this sum established.Optionally, a document is selected as relevant if it contains at leastone possible use sentence (Step S170).

In one specific embodiment, only one label is relevant for theidentification of possible uses: the label possible_use. This label isassigned to any sentence that contains at least one strong pair or atleast two weak pairs.

Certain of the keywords may be considered to be of special importance.In one embodiment, a sentence in which an important keyword appears isimmediately selected as a potential use sentence, irrespective of anyfurther rules.

While the above scheme works primarily at the sentence level, it is alsocontemplated that rules may be developed that allow the detection ofconstituent notions defining a concept that span multiple sentences.This may involve propagating the subtypes of the weak label at the wordpair level up to the document level, and writing document-level rulesthat indicate which combinations of these subtypes indicate a concept.The exemplary rules described herein may be best suited to shortdocuments, such as press releases. More sophisticated document-levelprocessing may be appropriate for longer documents.

The Ranking Module

The documents to be presented to the user can optionally be ordered by aranking function (Step S180) intended to concentrate the mostinteresting articles near the top of the list. This ranking function mayinclude one of two different sorts of information or a combinationthereof:

-   -   1. Degree of confidence that the document describes a possible        use of a new product    -   2. Relevance of the document to the user's topic

The degree of confidence that the document describes a possible use isestimated from information returned by the concept-matching module. Thiscan include one or more of: the identifier of the detection rule thatwas triggered (some rules may be more reliable than others), the numberof different rules that were triggered by that document, and the totalnumber of rules triggered (optionally modified by applying a weightingfor some or all the rules).

Estimating the relevance of a document to a topic is a classical problemwith well-understood solutions. Some document retrieval systems includerelevance ranking functionality. If such a system is used as thedocument base component, then its built-in relevance score can befactored into the final ranking. If the interface to the document basedoes not perform relevance ranking, a separate relevance ranking modulecan be implemented. In one embodiment, the user provides this modulewith a list of user-selected keywords, and weightings for thesekeywords. The ranking module 16 assigns a relevance score that is theweighted sum of the number of occurrences of those keywords. Theuser-selected keywords 60 may be the same user-selected keywordsinitially used to narrow the search.

In one embodiment, the ranking module 16 includes a component 48 forexpanding the user selected keywords 60 through access to a thesaurus orother lexical resource. The thesaurus may be integral with the rankingmodule or accessed thereby. The component 48 may interact with the uservia the user input 38 to select or deselect appropriate ones of theadditional words proposed by the expansion component 48. The rankingmodule 16 may also include a component 50 for determining confidence anda component 52 for determining relevance then combines these two scores(relevance and degree of confidence) to produce a single ranking. Givena sufficient number of annotated examples, parameters for combining thescores can be estimated statistically, using a linear model for example.Other ranking systems are also contemplated.

The concept matching system can be viewed as a special-purposeinformation retrieval (1R) system. It is particularly suited toapplications where mere keyword-based retrieval would not be effective,for example, because the expressions to be looked for tend to involvehigh-frequency, non-technical words that cause many false hits. Byemploying patterns of syntactically-linked keyword pairs rather thansimply individual keywords a higher recall is achieved without anunacceptable decline in precision.

To use the concept matching system on a database of documents, a useroptionally inputs a set of one or more user selected ranking keywords 60to the ranking module 16 which are applicable to the area of research inwhich the documents are sought (step S200). These user-selected keywordsare generally very broad terms intended to be inclusive, rather thanselective. For example, the user selected keywords may include the nameof a particular product class as well as various words known to be usedin discussions of that product, class even if such words are not highlydiscriminative for documents about the product class. The operator ofthe system may be prompted to provide the set of user-selected keywords.The system may expand on the user selected keywords, using a parser,thesaurus or based on cooccurrence statistics, and optionally providethe operator/end user with the expanded set of keywords and prompt theoperator to select appropriate ones to be used (STEP 210). Relevantdocuments are identified automatically using the system keywords andrules, and optionally ranked as described above, for example, using thesame or a different set of user-selected keywords by the relevancemodule. The retrieved documents can be composed into a listing, summaryor other format desired by an end user. The system may be usedperiodically to update the search results.

Alternatively or additionally, a user inputs user-selected corpuslimiting keywords 64, which may be the same or different from theuser-selected keywords 60 (Step S220). Applying these user selectedkeywords 64 to the text corpus at an initial stage, rather than in alater, document ranking process, can thus reduce the search time for theidentification of relevant documents. The user-selected keywords help toreduce the text corpus to be examined. The concept matching system thenoperates on a modified text corpus, as limited by the user-selectedkeywords.

Present text mining systems are insufficient for identifying possibleuse as the variety of expressions used to convey the possible useconcept is so great that using only frequent n-grams would result inextremely low coverage. The more flexible pattern formalism provided bythe present system can detect expressions that were unseen in thedevelopment corpus.

The concept matching system has application for a variety of domains.Although the exemplary system was developed using a test set ofdocuments from press releases, the same system, including keywords,constituent notions and rules can be used to find documents from anotherdatabase related to the same complex concept.

The following non-limiting example demonstrates the effectiveness of theconcept matching system.

It will be appreciated that various of the above-disclosed and otherfeatures and functions, or alternatives thereof, may be desirablycombined into many other different systems or applications. Also thatvarious presently unforeseen or unanticipated alternatives,modifications, variations or improvements therein may be subsequentlymade by those skilled in the art which are also intended to beencompassed by the following claims.

1. A system for retrieving documents related to a concept from a textcorpus comprising: a memory which stores: a set of semantic classeswhich are combinable to express the concept, a set of keywords for eachof the semantic classes to be used in searching documents in the textcorpus, each set of keywords including at least one keyword, and aplurality of syntactic rules to be applied to identified text portionswhich include one or more of the keywords, each of the syntactic rulesidentifying a first of the semantic classes and a second of the semanticclasses, the rule being satisfied when a keyword from the first of thesemantic classes and a keyword from the second of the semantic classesare in any one of a plurality of syntactic relationships; and a conceptmatching module, which accesses the memory, for identifying textportions within the text corpus which include one or more of thekeywords and for applying the syntactic rules to the text portions andidentifying those text portions which satisfy at least one of thesyntactic rules, whereby documents are retrieved which include at leastone of the identified text portions.
 2. The system of claim 1, whereinthe syntactic relationships include at least two of the following pairs,where the first of the semantic classes includes a keyword in one of thepair and the second of the semantic classes includes a keyword in theother of the pair: a verb and its subject; a verb and its object; one ofan adjective and an adverb and one of a noun, verb and adjective thatthe adjective or adverb modifies; a noun and another noun that the nounpremodifies; and a noun and one of a noun and verb that the nounmodifies by prepositional attachment.
 3. The system of claim 1, whereinthe plurality of syntactic relationships comprises at least foursyntactic relationships.
 4. The system of claim 1, wherein each of theplurality of syntactic rules applies the same set of syntacticrelationships.
 5. The system of claim 1, wherein a plurality of thesemantic classes each include a plurality of keywords.
 6. The system ofclaim 5, wherein at least four of the semantic classes each include atleast five keywords.
 7. The system of claim 1, wherein the plurality ofsemantic classes comprises at least three semantic classes.
 8. Thesystem of claim 1, wherein there are fewer than about ten semanticclasses.
 9. The system of claim 1, wherein a plurality of the semanticclasses correspond to constituent notions selected from the groupconsisting of: use, user, produce, producer, product, future,possibility.
 10. The system of claim 1, wherein at least one of thesemantic classes includes a keyword of another of the semantic classes.11. The system of claim 1, wherein at least one of the syntactic rulesidentifies a third of the semantic classes and a fourth of the semanticclasses, the rule being satisfied when a keyword from the first of thesemantic classes and a keyword from the second of the semantic classesare in any one of the plurality of syntactic relationships and a keywordfrom the third of the semantic classes and a keyword from the fourth ofthe semantic classes are in any one of the plurality of syntacticrelationships.
 12. The system of claim 1, further comprising: a rankingmodule which ranks retrieved text according to one or more rankingrules.
 13. The system of claim 12, wherein the ranking rules account forat least one of: a number of times a syntactic rule is satisfied in atext passage; a weighting of a syntactic rule; and a number ofoccurrences of keywords from a set of keywords which are considered toexpress the concept irrespective of the existence a syntacticrelationship.
 14. The system of claim 1, wherein at least some of thesemantic classes include keywords which are not synonymous with otherkeywords in the same semantic class.
 15. The system of claim 1, furthercomprising a parser, accessed by the concept matching module, whichidentifies the plurality of syntactic relationships in text and thesyntactic relationships optionally comprise all the syntacticrelationships that the parser identifies.
 16. The system of claim 1,wherein the computer implemented means applies a weighting to one ormore of the syntactic rules and identifies text portions within the textcorpus which satisfy a minimum weighted sum of the syntactic rulesoccurring in the text portion.
 17. The system of claim 1, wherein theconcept matching module includes: a component which labels selectedkeywords in the text corpus; a component which associates the labeledkeywords with a semantic class, each of the semantic classes including aplurality of the keywords; a component which labels pairs of keywords ofselected semantic classes which are in any one of a plurality ofsyntactic relationships; and a component which identifies documentswhich include a labeled pair of keywords.
 18. A computer implementedsystem for retrieving documents related to a concept from a text corpuscomprising: a component which labels selected keywords in the textcorpus; a component which associates the labeled keywords with asemantic class, each of the semantic classes including a plurality ofthe keywords; a component which labels pairs of keywords of selectedsemantic classes which are in any one of a plurality of syntacticrelationships; and a component which identifies documents which includea labeled pair of keywords.
 19. A method for retrieving documentsrelated to a concept from a text corpus comprising: labeling keywords inthe text corpus which belong to at least one of a plurality ofpredefined semantic classes; labeling pairs of labeled keywords whichare in any one of a plurality of syntactic relationships and which meetone of a plurality of predefined syntactic rules; and labeling documentswhich include at least one labeled pair; and retrieving at least aportion of the documents which include at least one labeled pair. 20.The method of claim 19, further comprising: labeling sentences in thetext corpus which include at least one labeled pair of labeled keywords.21. The method of claim 19, further comprising: using user-selectedkeywords to identify the text corpus within a larger text corpus. 22.The method of claim 19, further comprising: the labeling pairs oflabeled keywords which are in any one of the plurality of syntacticrelationships including employing a parser which identifies pairs ofwords which are in any one of a plurality of predefined syntacticrelationships.